In this R Markdown document, we will describe the process we used to
analyze this marketing budget dataset in order to detect trends and
identify how does the amount of each budget affect the sales.
Data
Source: Kaggle
The first step in our process is to load the different packages we will need into R. We use pacman to load the other packages because it does automatically install them too if they’re messing, which is more convenient. We also set the directory we will be working with using the setwd() function.
pacman::p_load(pacman, caret, lars, tidyverse, psych, lmtest)
setwd("C:/Users/Hamza/Desktop/Projects/Marketing Budget")
getwd()
## [1] "C:/Users/Hamza/Desktop/Projects/Marketing Budget"
Afterwards, we load and open our data
budget <- read.csv("Dummy Data HSS Clean.csv")
head(budget)
The describe function helps up get a general idea about our data by calculating some descriptive measurements. Note that Influencer is a categorical variable,
describe(budget)
Some of the conclusions we can draw from this data:
For this part, we will use 2 methods to verify correlation. First, we will plot the data to visually detect any trends in the data. Then we will calculate the correlation coefficient to be more precise.
plot(budget$TV, budget$Sales,
col = "purple",
main = "Budget: sales vs. tv budget",
xlab = "TV",
ylab = "Sales")
plot(budget$Radio, budget$Sales,
col = "blue",
main = "Budget: sales vs. radio budget",
xlab = "Radio",
ylab = "Sales")
plot(budget$Social_Media, budget$Sales,
col = "green",
main = "Budget: sales vs. Social media budget",
xlab = "Social_media",
ylab = "Sales")
In these plots, we can already notice that there is a very strong correlation between TV budget and sales. There also some correlation between Radio budget and sales. While when it comes to social media, the correlation is less apparent.
Now we try to calculate the coefficient of correlation between sales and the different variables.
budget %>% summarise(cor(Sales,TV),cor(Sales,Radio),cor(Sales,Social_Media))
The values of the coefficients confirm our observations from the plots.
In the next step, we try to build a regression model that tells us the contribution of each budget to the sales. The sales are considered the dependent variable, while TV, Radio, And Social_media are our independent variables.
First of all, we start by building the model using the lm function.
reg1 <- lm(budget$Sales ~ budget$TV + budget$Radio + budget$Social_Media)
# Inferential tests
summary(reg1)
##
## Call:
## lm(formula = budget$Sales ~ budget$TV + budget$Radio + budget$Social_Media)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.6158 -2.0002 -0.0075 2.0199 11.2581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.133963 0.102820 -1.303 0.193
## budget$TV 3.562570 0.003389 1051.118 <2e-16 ***
## budget$Radio -0.003970 0.009781 -0.406 0.685
## budget$Social_Media 0.004964 0.024884 0.199 0.842
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.95 on 4542 degrees of freedom
## Multiple R-squared: 0.999, Adjusted R-squared: 0.999
## F-statistic: 1.505e+06 on 3 and 4542 DF, p-value: < 2.2e-16
Based on the results of the inferential tests, we can conclude the following:
Now we will create another regression model using only TV as our independent variable and compare it to the previous one.
reg2 <- lm(budget$Sales ~ budget$TV)
summary(reg2)
##
## Call:
## lm(formula = budget$Sales ~ budget$TV)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.6062 -2.0062 -0.0125 2.0249 11.2566
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.132493 0.100605 -1.317 0.188
## budget$TV 3.561514 0.001676 2125.272 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.949 on 4544 degrees of freedom
## Multiple R-squared: 0.999, Adjusted R-squared: 0.999
## F-statistic: 4.517e+06 on 1 and 4544 DF, p-value: < 2.2e-16
By comparing the 2 models, we notice that The standard error is slightly less in the second model, the R-squared values are identical and the p-value is still less than 0.05. So it is easier to use the second model over the first one.
In order for our interpretations of the results from the model to be significant, there are a number of assumptions that we need to check.
This one can be easily verified using the plot between TV budget and sales. And as we can see, there is a very strong linear relationship between the 2.
This means that the error needs to follow a normal distribution. We can verify this using a histogram of the residuals
hist(residuals(reg2))
As we can see in the following histogram, the residuals do follow a normal distribution.
In order for our model to be significant, the variance of the residuals needs to be the same across all values of the predicted variable, which is called homoscedasticity. One of the ways to verify for this is by performing the Breusch–Pagan Test.
lmtest::bptest(reg2)
##
## studentized Breusch-Pagan test
##
## data: reg2
## BP = 5.9269e-05, df = 1, p-value = 0.9939
According to the Breusch-Pagan test, the BP value is small, and the p-value is over 0.05. Therefore, we do not reject the null hypothesis, and we assume that the residuals are homoscedastic.
Since our model passed all these tests, we can say with confidence that our results are statistically significant. Now all we have left is to plot the model.
plot(x=budget$TV, y=budget$Sales,col = 'green', main='The relationship between TV marketing budget and Sales', xlab='TV Budget', ylab='Sales')
abline(reg2,col='purple')
Based on this model, we can conclude that for the current customer segments, TV is the optimal way to promote our products, since the more we invest into this communication channel, the bigger our sales are, and we established causation from this correlation.
These results make us ask the following questions: